BACKGROUND - NO CODING ZONE

(NOT SO) HYPOTHETICAL SCENARIO

You are asked by your stakeholders to do some analysis. You spend a lot of time in Excel and PowerPoint putting together this analysis and accompanying presentation. After you present, your stakeholders ask to see the same analysis but by dfferent cuts. This seems like a simple request, but it actually involves re-running the entire process. This might happen several times in a project.

With the tools we are learning in R, we will be able to greatly reduce this type of back and forth. Not only can R automate this iterative process, but it can enhance the ultimate deliverable.

Last week we looked at how dplyr can help with our data prep. Today we will be focusing on 2 new packages for data visualization: ggplot2 and plotly.

Ultimately in lesson 4, Shiny will do an even better job of analyzing our data. Shiny is really just a wrapper around everything we are doing in lessons 1-3, so we need to learn these extremely useful basics (even outside of Shiny) first.

BASIC GGPLOT2 GRAPH:

Graphs created with ggplot2 use the following form:

ggplot(data,aes(x.y)) + geom_XXX()

DECONSTRUCT THE LINE ABOVE

We will only look at these 3 graph types in today’s lesson. There are several others (boxplots, density plots, heatmaps, more). All graph types follow the same general structure.

Use object oriented approach to make iterations easier.

object.name <- ggplot(data,aes(x,y))

You can quickly create 3 ways of visualizing this relationship between x and y:

All this will make more sense with some examples. Let’s begin our script.

WEEK 1 HOMEWORK

Last week you were asked to group our dataset by some specific cohorts. This week we will visualize the results. A sample solution along with additional supporting files can be found linked in the comment section below.

SUMMARY OF WEEK 1 HOMEWORK REQUESTS

  • TASK 1: Calculate actual to expected mortality by banded age.
  • TASK 2: Calculate actual to expected mortality by banded age by gender.
  • TASK 3: Calculate actual to expected mortality by banded age by gender by smoker status.

This was to be done for all 5 mortality tables, on a count AND dollar weighted basis.

SAMPLE SOLUTION

This includes code for an additional 4th task that we will discuss at the end of this lesson.

  • TASK 4: Calculate actual to expected mortality by banded age by gender by smoker status by preferred status
library(dplyr)
library(data.table)
library(ggplot2)
library(plotly)
library(gridExtra)
library(grid)



data.mortality <- fread('mortality.csv')
data.agebands <- read.csv('age_bandings.csv')


data.grouped1 <- data.mortality %>%
  inner_join(data.agebands, by = 'attained.age') %>%
  group_by(banded.age) %>%
  summarise_at(vars(Deaths,Death.Dollar,qx7580E.amount,qx7580E.policy,qx2001VBT.amount,
                    qx2001VBT.policy,qx2008VBT.amount,qx2008VBT.policy,qx2008VBTLU.amount,
                    qx2008VBTLU.policy,qx2015VBT.amount,qx2015VBT.policy),list(sum)) %>%
  mutate(atoe7580d = qx7580E.amount/Death.Dollar,
         atoe01VBTd = qx2001VBT.amount/Death.Dollar,
         atoe08VBTd = qx2008VBT.amount/Death.Dollar,
         atoe08VBTLUd = qx2008VBTLU.amount/Death.Dollar,
         atoe15VBTd = qx2015VBT.amount/Death.Dollar,
         atoe7580c = qx7580E.policy/Deaths,
         atoe01VBTc = qx2001VBT.policy/Deaths,
         atoe08VBTc = qx2008VBT.policy/Deaths,
         atoe08VBTLUc = qx2008VBTLU.policy/Deaths,
         atoe15VBTc = qx2015VBT.policy/Deaths) %>%
  select(banded.age,atoe7580d,atoe01VBTd,atoe08VBTd,atoe08VBTLUd,atoe15VBTd,
                    atoe7580c,atoe01VBTc,atoe08VBTc,atoe08VBTLUc,atoe15VBTc,)


data.grouped2 <- data.mortality %>%
  inner_join(data.agebands, by = 'attained.age') %>%
  group_by(banded.age,Gender) %>%
  summarise_at(vars(Deaths,Death.Dollar,qx7580E.amount,qx7580E.policy,qx2001VBT.amount,
                    qx2001VBT.policy,qx2008VBT.amount,qx2008VBT.policy,qx2008VBTLU.amount,
                    qx2008VBTLU.policy,qx2015VBT.amount,qx2015VBT.policy),list(sum)) %>%
  mutate(atoe7580d = qx7580E.amount/Death.Dollar,
         atoe01VBTd = qx2001VBT.amount/Death.Dollar,
         atoe08VBTd = qx2008VBT.amount/Death.Dollar,
         atoe08VBTLUd = qx2008VBTLU.amount/Death.Dollar,
         atoe15VBTd = qx2015VBT.amount/Death.Dollar,
         atoe7580c = qx7580E.policy/Deaths,
         atoe01VBTc = qx2001VBT.policy/Deaths,
         atoe08VBTc = qx2008VBT.policy/Deaths,
         atoe08VBTLUc = qx2008VBTLU.policy/Deaths,
         atoe15VBTc = qx2015VBT.policy/Deaths) %>%
  select(banded.age,Gender,atoe7580d,atoe01VBTd,atoe08VBTd,atoe08VBTLUd,atoe15VBTd,
                    atoe7580c,atoe01VBTc,atoe08VBTc,atoe08VBTLUc,atoe15VBTc,)




data.grouped3 <- data.mortality %>%
  inner_join(data.agebands, by = 'attained.age') %>%
  group_by(banded.age,Gender,smoker) %>%
  summarise_at(vars(Deaths,Death.Dollar,qx7580E.amount,qx7580E.policy,qx2001VBT.amount,
                    qx2001VBT.policy,qx2008VBT.amount,qx2008VBT.policy,qx2008VBTLU.amount,
                    qx2008VBTLU.policy,qx2015VBT.amount,qx2015VBT.policy),list(sum)) %>%
  mutate(atoe7580d = qx7580E.amount/Death.Dollar,
         atoe01VBTd = qx2001VBT.amount/Death.Dollar,
         atoe08VBTd = qx2008VBT.amount/Death.Dollar,
         atoe08VBTLUd = qx2008VBTLU.amount/Death.Dollar,
         atoe15VBTd = qx2015VBT.amount/Death.Dollar,
         atoe7580c = qx7580E.policy/Deaths,
         atoe01VBTc = qx2001VBT.policy/Deaths,
         atoe08VBTc = qx2008VBT.policy/Deaths,
         atoe08VBTLUc = qx2008VBTLU.policy/Deaths,
         atoe15VBTc = qx2015VBT.policy/Deaths) %>%
  select(banded.age,Gender,smoker,atoe7580d,atoe01VBTd,atoe08VBTd,atoe08VBTLUd,atoe15VBTd,
                    atoe7580c,atoe01VBTc,atoe08VBTc,atoe08VBTLUc,atoe15VBTc,)


data.grouped4 <- data.mortality %>%
  inner_join(data.agebands, by = 'attained.age') %>%
  group_by(banded.age,Gender,smoker,Preferred) %>%
  summarise_at(vars(Deaths,Death.Dollar,qx7580E.amount,qx7580E.policy,qx2001VBT.amount,
                    qx2001VBT.policy,qx2008VBT.amount,qx2008VBT.policy,qx2008VBTLU.amount,
                    qx2008VBTLU.policy,qx2015VBT.amount,qx2015VBT.policy),list(sum)) %>%
  mutate(atoe7580d = qx7580E.amount/Death.Dollar,
         atoe01VBTd = qx2001VBT.amount/Death.Dollar,
         atoe08VBTd = qx2008VBT.amount/Death.Dollar,
         atoe08VBTLUd = qx2008VBTLU.amount/Death.Dollar,
         atoe15VBTd = qx2015VBT.amount/Death.Dollar,
         atoe7580c = qx7580E.policy/Deaths,
         atoe01VBTc = qx2001VBT.policy/Deaths,
         atoe08VBTc = qx2008VBT.policy/Deaths,
         atoe08VBTLUc = qx2008VBTLU.policy/Deaths,
         atoe15VBTc = qx2015VBT.policy/Deaths) %>%
  select(banded.age,Gender,Preferred,smoker,atoe7580d,atoe01VBTd,atoe08VBTd,atoe08VBTLUd,
         atoe15VBTd,atoe7580c,atoe01VBTc,atoe08VBTc,atoe08VBTLUc,atoe15VBTc,)
  • Only new function from last week: summarise_at()
  • Simpler to use than summarise() for multiple columns
  • Found it with google - had not used it before doing my homework

We will focus on 7580d = dollar weighted values using the 7580E table. Selected this cohort arbitrarily - just want to demonstrate various graphs. We could easily change our explanatory variables to any of the actual to expected ratios we want to examine.

For next week’s lesson we will look more closely at the relationship among these 10 actual to expected calculation methodologies. This will be an opportunity to look at the dplyr function gather() which is very useful in data visualization with ggplot().

The snippet above was saved as Week2.R.

TASK 1: TARGET VARIABLE BY SINGLE EXPLANATORY VARIABLE

This is a very common and simple request. We have an example from last week’s homework:

TASK 1: Actual to Expected mortality rates by banded age.

This is a simple relationship - which means it is a good introduction to ggplot2 - even if the graphs are rather boring.

First step: Create your ggplot object for the exercise.

graph.task1 <- ggplot(data.grouped1,aes(x = banded.age, y = atoe7580d, group = 1))
scatter1 <- graph.task1 + geom_point()
line1 <- graph.task1 + geom_line()
bar1 <- graph.task1 + geom_bar(stat = "identity")
scatterline1 <- scatter1 + geom_line()

#scatter1
#bar1
#line1
#scatterline1
scatter1

bar1

line1

scatterline1

Running the code in an R Script will also demonstrate how graphing objects render in R.

TASK 2: TARGET VARIABLE BY 2 EXPLANATORY VARIABLES

As always, our first step is to create our ggplot object.

graph.task2 <- ggplot(data.grouped2,aes(x = banded.age, y = atoe7580d, color = Gender, fill = Gender))

Once we build our ggplot object, adding our geometries is the same as in TASK 1.

scatter2 <- graph.task2 + geom_point()
line2 <- graph.task2 + geom_line()
stackbar2 <- graph.task2 + geom_bar(stat = "identity",position = "stack")
fillbar2 <- graph.task2 + geom_bar(stat = "identity",position = "fill")
dodgebar2 <- graph.task2 + geom_bar(stat = "identity",position = "dodge")
scatterline2 <- scatter2 + geom_line()

#scatter2
#line2
#stackbar2
#fillbar2
#dodgebar2
#scatterline2

Notice we now have additional bar graphs. We are looking at 3 different types of bar graphs which are differentiated by the “position” argument in the geom_bar() function:

scatter2

line2

stackbar2

fillbar2

dodgebar2

scatterline2

Choosing which geometry is right for your use case comes down to several subjective factors. There are some general rules for which types of plots are best for combinations of different variable types. It is good to understand the reasoning behind these. However, the flexibility of R makes it possible to quickly iterate through several chart types.

I will now demonstrate the TASK 2 code in an R script.

TASK 3: TARGET VARIABLE BY 3+ EXPLANATORY VARIABLES

There is a very powerful layer you can add to look at an additional dimension. This is called faceting. There is facet_wrap() and facet_grid(). Let’s start with facet_ wrap().

Task 3 from our homework looks at actual to expected by banded age x gender x smoker. As usual the first step is to create our ggplot object.

graph.task3 <- ggplot(data.grouped3,aes(x = banded.age, y = atoe7580d, color = Gender, fill = Gender))

FACET_WRAP()

Next we add our geometries like we have in the previous tasks. We also add an additional layer to our ggplot object. We add a facet_wrap() to display experience split by levels of the variable smoker.

linepoint3 <- graph.task3 + geom_line() + geom_point() + facet_wrap(~smoker)
linepoint3

The graph above demonstrates facet wrap by smoker status: facet_wrap(~smoker). This results in a copy of your graph for every level of smoker status.

FACET_GRID()

Next we will look at facet_grid(). Facet_grid(var1~var2) takes 2 variables as arguments so we are able to add an additional Explanatory Variable using facet_grid().

To demonstrate , we added Preferred Status as a grouping variable in task 4 earlier in this lesson. Let’s visualize this relationship.

As always, start with your ggplot object. The aesthetics are the same as we used in Task 3.

graph.task4 <- ggplot(data.grouped4,aes(x = banded.age, y = atoe7580d, color = Gender, fill = Gender))

Next we add our additional layers.

linepoint4 <- graph.task4 + geom_line() + geom_point() + facet_grid(Preferred~smoker) + xlab("Banded Ages") + ylab("Actual to Expected") + ggtitle("Dollar Weighted Actual to Expected Analysis Using 7580E")
linepoint4

As you can see above we now have our rows corresponding to the levels of variable 1 (Preferred Status) and columns colrrespond to levels of variable 2 (Smoker Status).

These levels are interacted on each other. For instance, the top left facet in our example above consists of only policies that are BOTH non-smoker and non-preferred.

The code above also added a couple of additional layers to our graph: * Used xlab() to create X Label * Used ylab() to create Y Label * Used ggtitle() to create chart title * Very simple to iterate through very simple base graphs and add more layers as you move closer to production

I will now demonstrate the TASK 3 code in an R script.

TASK 4: PLOTLY

Very simple to use plotly with ggplot graphs:

ggplotly(graph.object)

I will demonstrate all of the graphs we have created this week - both with and without plotly. You can see these graphs in the appendix below.

SUMMARY

The previous lesson was meant to serve as an introduction to data visualization in R. The best graphs are often ones that offer stakeholders new insight into a business problem. This often comes from applying innovative visualization types to existing problems. The graphing possibilities with R are nearly unlimited. The more you explore this potential, the more effective you will be at applying R to your actuarial tasks. We will begin to explore more of this potential in this week’s homework and next week’s lesson.

TASK 5: HOMEWORK

Use only the 2015 table. Using both count and dollar weighted ratios. Look at how mortality actual to expected rates are affected by the following:

Use the same dataset that we used for our weeks 1 and 2 lessons.

Use any combination of aesthetics and geometries to visualize the relationship between these 4 explanatory variables and the target variable (actual to expected rates). You should use the dplyr library for your data prep, and use the ggplot2 and plotly libraries for your data visualization. I recommend you use a combination of the geometries we went over in this week’s lesson, as long as others you discover on your own (google will be a big help here).

We will use these graphs along with your insights in next week’s lesson: DOCUMENTING YOUR RESULTS WITH R MARKDOWN

A summary of all the graphs we created today can be found in the appendix below. Each graph is shown first without plotly, then with plotly. This html document was created using only the packages we have covered in lessons 1 and 2 - along with Markdown - which we will cover next week.